Introduction

The results of the 2016 election presented one of the most major electoral upsets in recent history, where the Republican candidate pulled off a surprising win over his more favored Democratic opponent. For this project, we seek to provide a comprehensive debriefing of the 2016 election through a presentation of demographic and past election trends.

We first collect and compile county-wide election and census data, cleaning out and matching any inconsistencies among the counties. Then we provide exploratory data analysis by looking for correlations between demography and favored parties. In the maps section, we represent the election results and examine trends, or breaks in trends, on a year-by-year basis.

Following these observations, we create two predictor functions using recursive partitioning and K-NN separately that test the relevance of our data in predicting election results. Factors such as race, income level, and education prove informative, as usual, and boost the accuracy of our results. At the same time, however, they yield misclassifications in surprising areas, compared to the actual election results. These misclassifications are especially interesting when taking into consideration the atypical election season that is 2016, and we explore and hypothesize more of the potential underlying reasons of these misclassifications in the Discussion section.

Data Description

author: Kimberly Kao

There are several variables to compare with party vote: race, education level, language spoken at home, place of birth, and income levels. Because our census data is from 2010, we chose to use the data from election 2008 as demographics are more likely to be similar compared to election 2004 or 2016. We chose to use election 2008 over 2012 because we have the absolute number of third party votes for 2008. Since our numbers for party votes and other census data variables were absolute numbers, we calculated proportions to compare different variables (numeric vectors) and used these to also create factor variables for tracking whether there was a majority of some variable.

There is a clear relationship between proportion of white and black populations with proportion of Democratic and GOP votes: counties with greater proportions of white populations tended to have a GOP win while counties with greater proportions of black populations tended to have a Democratic win. One point to keep in mind is that we had 1,532 counties that were missing values for black population, so our plot uses a significantly smaller dataset.

We also examined the relationship between education levels and party win. We separated counties into groups with an above average high school or college educated population. This was done by using the national average high school graduation rate and college enrollment rate in 2008. We found that comparing counties with an above average high school education population proportion with those below average showed no definitive relationship with party vote as both groups had a greater proportion of GOP wins. When we examined college education levels, we found that the group of above average college educated counties have a roughly even proportion of party wins (proportion of counties with GOP and Democratic wins are roughly the same).

We then created compared language spoken at home and place of birth with party vote, but they also do not provide a clear pattern. Comparing proportion of population with a middle class income (35-150k) with party vote did not show a clear pattern, but when we included another variable (whether the county had a white majority), we could see that counties with higher proportions of low income population groups with no white majority tended to have greater proportions of Democratic vote.

Introduction

We create several plots for comparing different variables with party vote. We chose to use the data from election 2008 as the demographics in 2008 are more likely to be similar to the data collected from the 2010 census.

Exploring the distribution of Democratic and GOP votes

We create a quantile-quantile plot.

Analysis:

The bend in the curve indicates a difference in the two distributions. It appears that proportion of Democratic votes has a longer right tail compared to GOP. In the next few sections, we explore various variables and the difference in party vote.

Exploring Race

We want to examine the relationship between race and party vote majority.

Here, we create 2 scatterplots: (1) Comparing proportion of white population with proportion of Democratic vote (2) Comparing proportion of black population with proportion of Democratic vote (3) We also create a scatterplot taking into account voter turnout in comparing party win and white/black population for election 2008.

## Warning: Removed 6 rows containing missing values (geom_point).

## Warning: Removed 6 rows containing missing values (geom_point).

## Warning: Removed 1532 rows containing missing values (geom_point).

## Warning: Removed 1532 rows containing missing values (geom_point).

## Warning: Removed 6 rows containing missing values (geom_point).

Conclusion: The scatterplots indicate that counties with a larger white population have a larger proportion of GOP votes and smaller proportion of Democratic votes.

We are missing 1532 values for black population, so our plots uses a smaller subset of our dataframe. From the plots created, we can see that counties with a larger black population have a smaller proportion of GOP votes and smaller proportion of Democratic votes. Voter turnout rate affects little change of the relationship between party vote and race.

Exploring Education Levels

Here, we want to know: Is high school graduation rate a factor? Is college education a factor?

http://www.edweek.org/ew/dc/2013/gradrate_trend.html National high school graduation rate in 2009-2010 was 74.7%. We use this rate in calculating whether a county has above average proportion of high school educated population.

https://www.washingtonpost.com/news/education/wp/2015/11/24/college-enrollment-rates-are-dropping-especially-among-low-income-students/ National college graduation rate in 2008 was 69%. We use this rate in calculating whether a county has above average proportion of college educated population.

We attempted to create scatterplots for: (1) Comparing proportion of college educated in population above 25 years old and party vote - Bachelor’s, Master’s, PHD’s, and beyond are grouped together. (2) Comparing proportion of at least high school educated in population above 25 years old and party vote.

Howvever, the scatterplots did not provide information about any relationship, so we also create a side-by-side barplot for comparing number of voters in each education level and party vote.

## Warning: Removed 3 rows containing missing values (geom_point).

Initially we tried creating a scatterplot to compare college educated and party vote. However, this plot showed no relationship, so we decided to use a side-by-side bar graph instead. However, it appears through the bar plot that a the counties with a below average proportion of high school educated population show a similar pattern with counties with above average proportion in that both groups have more counties with GOP win. There is a roughly even proportion of counties with GOP and Democratic win in the group of counties with an above average proportion of college educated population.

Exploring Place of Birth

We attempted to find a relationship between place of birth and party vote using a scatterplot. However, there appears little relationship to be found by using county profile data as most voters are native-born.

## Warning: Removed 3 rows containing missing values (geom_point).

## Exploring Language Spoken at Home

We compare proportion of population with English proficiency with party vote.

  1. Comparing proportion of population with English as only language with party vote.
  2. Comparing proportion of Spanish speakers with party vote.
  • We use the population who speak Spanish at home and speak English less than very well.

However, the scatterplots indicate trivial relationship as most voters primarily speak English at home.

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 3 rows containing missing values (geom_point).

Exploring Income Levels

We create a side-by-side barplot to explore whether a county has a majority middle class income range has a relationship with party vote. We also create a scatterplot to compare proportion of population in the lower 2 income brackets (10-25k) with proportion of Democratic vote.

## Warning: Removed 3 rows containing missing values (geom_point).

Analysis: When we compare the proportion of population in the lower income ranges with party vote colored by whether the county had a white majority population, we found that the counties without a white majority population had a larger proportion of Democratic vote.

Maps

author: Margaret Chen

This part of the code extracts information from the maps package and matches it with the final merges data frame that the team has created. To make matching easier, we also extracted the FIPS information from the maps package to match with the maps package counties. Using the FIPS code, then, we merge the data frame created through the maps package and our final election/census/location data frame.

Finding Missing Counties

Using the data frame created in the previous section, we can plot the county data using ggplot2 and geom_poly(). The map is plotted with the fill aesthetics equated to a factor, so that if the county is not matched with the rest of the data frame, the county is colored in red. At the same time, to test that the longitude/latitude data makes sense, we’ve overlapped the plots on the map. Some vectors are also provided with desired results next to them.

In addition, the list “missingValues” is also created to spot any NAs in each column of the finalMerge_maps data frame where there exists an id_mapCounty. In other words, we’re looking for counties (as defined by the maps package) that have missing information in terms of census data, election data, etc. A more detailed explanation of the result is provided below the code chunk.

## character(0)

“The missingValues list contains mostly empty vectors, with a few exceptions. We do not worry about them for the following reasons. The precint numbers for 2008 indicates there is missing information for Washington D.C., but since D.C. is essentially a city, this information is naturally missing - it’s implied that, if D.C. is reported,”all" of its precincts are reported. On the other hand, census data appears to all be missing for three Texas counties: Kenedy, King, and Loving. According to their Wikipedia pages (cited at the end of this report), however, they are three of the least populous counties in the United States, with Loving County being the least populous at 82 people. For this reason, it is likely that certain demographic information becomes hard to calculate in the census. Within the census information also, there is 1527 missing values for the black population. Since African-Americans remain a minority in the United States, it is likely that many of these counties simply do not have significant black populations. The same goes for the entries that indicate NA’s for white populations.

Election Results by Year, Size of Relative Lead, Size of Absolute Lead

This section is dedicated to plotting each of the four election results by the size of lead relative to each county’s population and by the size of lead in terms of absolute votes.

There are two helper functions provided in the code to plot the data by relative lead and by absolute lead. The relative lead function takes in the size of lead, normalized by each county’s population, and is transformed into a factor that is then plotted into ggplot. Each county is shaded in, with darker shades representing a relatively larger margin won (blue for Democrats, red for Republicans).

The absolute lead function, on the other hand, takes in the absolute margin won and a logical factor determining either it is a Democratic win (marked by TRUE) or a Republican win (marked by FALSE). This map plots a blue or red circle, representing a Democratic or Republican win respectively, with the size of the circle corresponding to the size of the final margin. The county names of the counties with the largest Republican margin and the largest Democratic margin are also provided in red and blue text, respectively.

Note that there should be an NA value for the Bedford City, Virginia data in 2016. The county/independent city was merged back to Bedford County in 2013. (source: https://en.wikipedia.org/wiki/Bedford,_Virginia)

Analysis of the maps are provided at the bottom of this code chunk.

Analysis

Democrats tend to have leads over Republicans in coastal areas and large cities. Republicans’ stretch tends to be more spread out among the rural states, with fewer leads over all. For the 2016 election especially, there is a resurgence of Republican support along the Appalachian Mountain counties that had been present since 2004 but really did seem to consolidate at around 2012. Democrats, on the other hand, seem to be diminishing their leads in the MidWest, especially from the 2008 election and the Obama Coalition.

The Absolute Lead maps also hint at some polarization in the United States, where the counties that are previously Republican and Democratic gain ever-larger leads in their respective neighborhoods, with the possible exception of the 2008 election. The effects are most striking when comparing the 2004 election, for which there is a relatively sparse map, to the 2016 election, in which leads in highly Democratic and Republican counties are increased, as shown by the more crowded map. This may be part of the reason why our predictor functions were able to turn out a relatively high prediction rate.

Another potentially interesting point is the presence of third parties. Since the data size for third parties are relatively small, it’s most likely not advisable to do detailed analysis of each state. What is clear, however, is the steady rise in the portion of votes that third parties take up from 2008 to 2016 (2004 data is unavailable), which may have muddled our predictors and prevented a more accurate result. In addition, the combination of large portions of third-party votes and the polarization in the 2016 election may signal general discontent Americans have with the current state of the country.

Third Party

Predicting the 2016 Results (Christine and Eugene)

We used recursive partitioning with 2-fold cross validation to predict 2016 results. A factor variable of Democrat or Republican was created to show the winning party. Other variables used are 2012 election outcomes, Census 2010 data, and state names to train our predictor. We implemented 2-fold cross validation by state meaning that we split the counties in each state. Half the counties in the state are then used as the training data and the other half as the testing data and then this would be repeated with the folds switched. Instead of using a matrix, a list was used to create the folds to account for the fact that not all the folds will be equal. After training the data, we calculate the misclassification rates of the cross validated folds. These are plotted and the max is chosen which is .007 for our specific complexity parameter. This complexity parameter is used to create the final tree structure. The 2016 data is then tested with this tree and then the misclassification rate is calculated to be about .91 which did better than the training data. Another data frame was created to aid in looking at the nodes that each county landed in.

This methodology would simulate the hypothetical situation of a pre-2016 election prediction of outcome. Although all census variables were used to train, only a select few were actually used in the predictor. The predictor used State name, Language spoken at home, White population, Black population, Total population, $10-15k income, <$10k income, and Place of Birth. Our predictor was especially inaccurate for predicting Maine because Obama won all but one county in Maine during 2012, but in 2016 many previously blue counties voted for Trump. In fact, because Obama won so overwhelmingly our predictor predicted all of Maine’s counties go Blue.

The State variable tends to be higher up in the tree, which makes sense since they are such overarching distinct classifications. It is interesting to note that the 4 largest nodes are all Republican. Perhaps this could parallel to how typically Democrats come from more diverse demographics and Republicans being more homogenous in demographics

## [1] TRUE
## [1] 15

## [1] 0.8961081
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Map K-NN Predictor

The follow code maps out where the predictor function produced incorrect classifications.

Predicting the change from 2012 to 2016

Author: Kimberly Tze

For the predicting the change from 2012 to 2016, we used 2-fold cross validation and k-nearest neighbors. In order to classify the data, we created a factor variable that denotes whether the county changed from Republican to Democrat, changed from Democrat to Republican, stayed Republican, or stayed Democrat. To create the training set and the test set, for each state, ⅔ of the counties were randomly chosen to be in the training set and the remaining ⅓ was chosen to be in the test set. The training set was then split in half and used in 2-fold cross validation to choose a value for k. In order to choose a good value for k, 2-fold cross validation was run using the 2 training sets and values of k ranging from 1 to 25, and then the accuracies of those predictions were plotted to see which value of k gave the most accurate predictions on average. Finally, after choosing a value of k, the predictor was evaluated with the test set. This whole algorithm was run several times to adjust the predictor variables and k as needed, until we found a combination that was pretty accurate.

The final predictor variables used were the county’s longitude, latitude, proportion of white population, proportion of native born residents, proportion of people with an education level of high school graduate or less, proportion of lower income households ($0 - $34,999), and proportion of middle income households ($35,000 - $149,999). The last three of these predictor variables are aggregates of some of the other census variables. Because knn requires that no predictor variables be NA, the few counties with no census data were not included in the training or test sets.

With the final value of k (5) and the final set of predictor variables (above), the accuracy of predicting every county using the ⅔ size training set was about 86% on average.

Where this predictor does not work well is consistency. Because knn makes its predictions based on the training set and the data of our training set are chosen randomly, then the final accuracy will vary depending on which data made it into the training set. Even with a static training set and a static data set that we want to predict, the accuracy of knn can still vary slightly. We found that the predictor variables chosen were the best at predicting counties that stayed Republican (out of the four possible classifications). The K-NN predictor also seemed to work not as well on predicting the classification for swing states.

Loading the data frame

Cleaning the data frame a little more

Because in 2013, Bedford City, VA, merged back into Bedford County, there is no data for Bedford City from 2016. Thus, I decided to also merge Bedford City into Bedford County for 2012 (by adding the number of votes together), and then remove the row for Bedford City. Because some of the predictor variables we wanted to used were from the census data, we also had to subset out any counties that did not have any census data.

Choosing predictor variables

After choosing our predictor variables, I subsetted out any rows that have NAs in those variables, or else you can’t run knn on it. Because most of the variables in the census data are absolute numbers, I also converted these to proportions, so the counties can more easily be compared (since some might have higher populations than others). For some of the census data, such as education, there are several different levels (ex. education less than 9th grade, high school education with no diploma, etc.), so I aggregated them into one level (ex. high school graduate or less) and took the proportion of that.

Creating the factor variable

For the change from 2012 to 2016, I created a factor variable that denotes whether the county changed from “Republican to Democrat”, changed from “Democrat to Republican”, “Stayed Republican”, or “Stayed Democrat”.

Creating the training and test sets

To perform K-NN, I first separated the data into training and test sets, where for each state, I randomly chose 2/3 to be in the training set and 1/3 to be in the test set. We decided to use 2-fold cross validation in order to find a good value for k, so I also split the training set into two equal halves.

Picking a k-value using 2-fold cross validation

In order to pick a good value for k, for i from 1 to 20, I ran the knn() function with training1 as the training set and training2 as the test set and vice versa, with k=i. I found the accuracy of each of the classifications returned from knn and plotted them on a graph, along with their average, to see which value of k was the most accurate.

Evaluating the accuracy of the predictor

When the code chunk above was run several times, it appeared that 5 was a pretty good value for k because the accuracy seems to peak around there, so we used this value for the final evaluation.

## [1] 0.8256637

Final prediction

After running all of the code above several to adjust which predictor variables and value of k we used, I ran a final prediction on the all of the counties, using the same 2/3rd of the data as the training data.

## [1] 0.8623469

Map K-NN Predictor Rate

The following code chunks map out where the K-NN predictor produced incorrect classifications.

Both Predictors on Map

The following code chunks map out where the both predictors produced incorrect classifications.

Discussion

The predictor functions both did well in more rural Midwestern areas as well as the Appalachians, where it is likely that the demographic information provided fits well with voting patterns. On the other hand, however, the functions both did poorly on the East and West Coasts and in the Rust Belt area. As seen from the bar charts above, both of our predictor functions did proportionally more poorly for counties that eventually were won by Democrats. Since our functions rely on the 2010 census, there may be demographic changes in the meantime that eroded the accuracy of our models. For instance, the Latino/Latina population was especially active this election season, which played prominent roles in places like Florida, Arizona, Texas, New Mexico, and even California. These changes would not have been apparent using the 2012 election data, for instance, since voters are exhibiting new behavior and breaking away from our static demographic data.

On the other hand, there is still a sizable difference between our predictor function’s results and the actual results when it comes to the Republican counties won, though less drastic. This is in line with the political postmortem following November 8th: the Republican candidate was able to flip many of the Democratic strongholds such as Michigan, defying previous voting patterns that our data relies on. Many of these changes were attributed to discontent among white Rust Belt working-class voters and are especially relevant for the accuracy of the K-NN function, which takes into account latitude/longitude, white population, native-born population, education levels, and income level information. Like the inaccuracies with respect to Democratic wins, these factors are then outside the scope of our data-and the predictive power, it seems, of much of the mainstream media.

References

Packages: - ggplot2 - RColorBrewer - knitr - maps - grid - gridExtra - rpart - rpart.plot - treeClust - class

Inspirations - http://www.edweek.org/ew/dc/2013/gradrate_trend.html - https://www.washingtonpost.com/news/education/wp/2015/11/24/college-enrollment-rates-are-dropping-especially-among-low-income-students/ - http://www.stat.wisc.edu/~gvludwig/327-5/maps#/12 - http://eriqande.github.io/rep-res-web/lectures/making-maps-with-R.html - https://cran.r-project.org/web/packages/maps/maps.pdf - http://docs.ggplot2.org/0.9.3.1/scale_manual.html - http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf - http://www.statmethods.net/advgraphs/layout.html

Reference for Texas County Populations in Maps section - https://en.wikipedia.org/wiki/Kenedy_County,_Texas - https://en.wikipedia.org/wiki/Loving_County,_Texas - https://en.wikipedia.org/wiki/King_County,_Texas

Sources for Predicting the change from 2012-2016: - Lecture Slide 37 “Shiny” (for inspiration) - class package (for the knn function)

Predictor for 2016: - Lab8 - Hw5